Efficient implementation of the DCT on custom computers

نویسندگان

Neil W. Bergmann

Yuk Ying Chung

Bernard K. Gunther

چکیده

The discrete cosine transform (DCT) is a key step in many image and video-coding applications, and its efficient implementation has been extensively studied for software implementations and for custom VLSI. In this paper, we analyse the use of the distributed arithmetic algorithm for the efficient implementation of the DCT in reconjigurable logic. One of our current research projects explores the use of custom computers for implementation of video compression algorithms, in particular in the performance of such an approach to video processing in comparison to more conventional software or custom VLSI hardware techniques. Additionally, we wish to discern particular programming paradigms which are a good fit to the custom computing environment. Overall, we hope to learn when it is profitable to use custom computing techniques, and in those situations how best to design the hardware system. Intuition and experience suggests that the choice of an efficient hardware structure for an algorithm must take account of the peculiar resources available on a particular custom computer. Efficient use of FPGA resources requires that large global communication structures such as buses be kept to a minimum local communications are preferred. Additionally, most FPGAs incorporate an optional state-holding latch with every programmable logic gate, allowing very fine-grain pipelining at low cost. Together, these two features (local communications and fine-grain pipelining) encourage the use of the systolic array parallel processing paradigm. Where possible, a continuous data stream is fed through a heavily pipelined, locally interconnected PE network. This is one key programming paradigm which we have identified. Our current custom computer (GigaOps Spectrum) employs Look Up Table (LUT) based FPGAs, where programmable logic functions are provided by setting the contents of many small ROMs or LUTs. If our hardware algorithm can be more directly expressed in terms of LUTs rather than logic gates (which will be automatically converted to these LUT structures), then additional implementation efficiency can be expected. To investigate this principle, we are completing the design of both a conventional and a LUT-optimised version of the two-dimensional Discrete Cosine Transform (2-D DCT), which is a key step in many image compression standards, such as MPEG, JPEG and H.261. An 8x8 pixel 2-D DCT can be decomposed into 8x1-D DCTs across the rows, followed by 8 1-D DCTs down the columns. Figure 1 shows a generic architecture for such a system. Each of the 16 I-D DCT operations is identical, and they can be done sequentially by a single processing element, or in parallel. We describe the design of a single 8x1-D DCT operation. Bernard K. Gunther Advanced Computer Research Centre University of South Australia The Levels , SA 5095 AUSTRALIA [email protected] A vector of signal values: [x0 , x1 , . x7 ] is converted into a vector of transformed values: [ya , y1 , . . . y7 ] where each output element is formed by a weighted sum of the input vector: Y, = $a,,,*xx wherea,, =b,.cos(2(k’I~‘m’n).....(l, Using the distributed arithmetic method [l] of implementing this transform requires 8 LUTs each of 256 entries (8 address bits) plus 8 accumulators (one for each output). Symmetries in the coefficients amk mean that we can recode the 8 inputs into two vectors of 4 elements, with each output using only one vector: [wl,w1,w2, w3] = [x” + x7, xl + x6, x2 + x5, x3 + x.4 . . . . . . . . . . . . . . (2) [Z” , Zl , z2 , z3] = [xc, x7, x1 X6, x2 x5, xg x4] . . . . . . . . . . . . . . . . . . . . . (3) Y, = 2 a,,w, for m even, ym = 2 a,,z,for m odd . ..(4) k=O k=O Using the distributed arithmetic method on this revised transform requires 8 LUTs of 16 entries (4 address bits) plus 8 accumulators (one for each output). Since our Xilinx LUT-based FPGAs employ 16 entry LUTs for most of their programmable logic gates, this leads to a very efficient implementation the logic to calculate each output is approximately the same size as a single serial multiplier. Figures 2 and 3 show the block diagrams for this circuit. Note the input shift register structure (P and Q in figure 2) which convert a stream of g-bit samples into 8 bit-serial data streams. More details of this network are given in [2]. In a conventional “Fast DCT” network [3] a pipelined array of 13 multipliers and 29 adders is used to implement each 8x1 DCT, as shown in figure 4. The effective size of the distributed arithmetic circuit is 8 multipliers plus 8 adders. We are currently confirming these area estimates which show that the LUT-based approach requires only half the area, measured as number of Xilinx CLB’s. Our next experimental step will be to deeply pipeline this architecture and evaluate the resulting area-time performance of a range of different pipelined implementations of the system. Results are expected to differ from conventional VLSI systems, since latching the output of a function in a CLB is cheap, but building a latch without logic which consumes a whole CLB is expensive. With systolic computation, we hope to achieve continuous calculation of 8x8 DCTs for an input data stream at the digital TV sample rate of 13.5 MHz. Potential applications include prototyping new video compression algorithms, prototyping MPEG II decoders (a high volume part), and production real-time MPEG-II encoders (a low volume market). 244 O-8186-8159-4/97 $10.00

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Accurate Fruits Fault Detection in Agricultural Goods using an Efficient Algorithm

The main purpose of this paper was to introduce an efficient algorithm for fault identification in fruits images. First, input image was de-noised using the combination of Block Matching and 3D filtering (BM3D) and Principle Component Analysis (PCA) model. Afterward, in order to reduce the size of images and increase the execution speed, refined Discrete Cosine Transform (DCT) algorithm was uti...

متن کامل

Full Custom VLSI Implementation of High-Speed 2-D DCT/IDCT Chip

In this paper we present a full-custom VLSI design of highspeed 2-D DCT/IDCT processor based on the new class of time-recursive algorithms and architectures which has never been implemented to demonstrate its performance. We show that the VLSI implementation of this class of DCT/IDCT algorithms can easily meet the high-speed requirements of HDTV due to its modularity, regularity, local connecti...

متن کامل

Applying an XC6200 to Real-Time Image Processing

mentum over the past few years.1 A customcomputing machine (CCM) consists of a host processor such as a microprocessor connected to programmable hardware that implements the computationally complex part of a program. The concept arose from the fact that in microprocessor implementations, most computationally complex applications spend 90% of their execution time on only 10% of their code.2 Beca...

متن کامل

Designing a Custom Architecture for DCT Using NISC Design Flow

This paper presents design of a custom architecture for Discrete Cosine Transform (DCT) using No-Instruction-Set Computer (NISC) design flow that is developed for fast processor customization. Using several software transformations and hardware customization, we achieved up to 10 times performance improvement, 2 times power reduction, 12.8 times energy reduction, and 3 times area reduction comp...

متن کامل

FPGA implementation of short critical path CORDIC-based approximation of the eight-point DCT

This paper presents an efficient approach for multiplierless implementation for eight-point DCT approximation, which based on coordinate rotation digital computer (CORDIC) algorithm. The main design objective is to make critical path of corresponding circuits shorter and reduce the combinational delay of proposed scheme. 1. INTRODUCTION It is well know that the discrete cosine transform (DCT) h...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1997

Efficient implementation of the DCT on custom computers

نویسندگان

چکیده

منابع مشابه

Accurate Fruits Fault Detection in Agricultural Goods using an Efficient Algorithm

Full Custom VLSI Implementation of High-Speed 2-D DCT/IDCT Chip

Applying an XC6200 to Real-Time Image Processing

Designing a Custom Architecture for DCT Using NISC Design Flow

FPGA implementation of short critical path CORDIC-based approximation of the eight-point DCT

عنوان ژورنال:

اشتراک گذاری